Majority of Americans are already suffering from heart disease:
About 90% of the risk for the first heart attack is because of conditions or lifestyle factors that are preventable
Lifestyle risk factors
There is plenty of room for improvement when it comes to raising awareness for cardiovascular risk, based on a recent study that found one in five adults at risk for heart disease don’t recognize a need to improve their health. We believe that a better understanding of risk factors underlying health perceptions and behaviors is needed to capitalize on cardiovascular preventive efforts.
We sought to examine the risk factors for heart disease and their prevalence in New York City. We were also interested in examining novel ways of visualizing the correlation of indivisual risk factors with heart disease and stroke based on our knowledge gained in the Data Science class. In addition, we also wanted to find ways of predicting and visualizing an individual’s risk for heart disease. As such, our questions were as follows:
In order to address the first two questions, we utilized the 500 Cities: Local Data for Better Health dataset. The 500 Cities project is a collaboration between CDC, the Robert Wood Johnson Foundation, and the CDC Foundation. The purpose of the 500 Cities Project is to provide city- and census tract-level small area estimates for chronic disease risk factors, health outcomes, and clinical preventive service use for the largest 500 cities in the United States. These small area estimates will allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions. Since we were interested in visualizing these data only for New York city, we filtered the dataset to include data only from New York city.
Risk factors that we’re interested in:: (15 total)
Outcomes that we’re interested in:: (2 total)
In order to predict risk for heart disease for an individual, we utilized the Framigham Risk Score. The Framingham Risk Score is a gender-specific algorithm used to estimate the 10-year cardiovascular risk of an individual. The Framingham Risk Score was first developed based on data obtained from the Framingham Heart Study, to estimate the 10-year risk of developing coronary heart disease. We utilized the algorithm to develop visualization in ShinyApp as discussed below.
We scraped the data directly from the web as seen in the code below and called it cvrisk.
cvrisk_url = "https://data.cdc.gov/api/views/6vp6-wxuq/rows.csv?accessType=DOWNLOAD"
## Read in data from github
cvrisk =
read.csv(url(cvrisk_url)) %>%
janitor::clean_names() %>%
as_tibble()
We then restricted our database to New York City and included only the variables that were useful for further data visualization. We called this database nyc_cvrisk.
nyc_cvrisk =
cvrisk %>%
filter(state_desc == "New York", city_name == "New York",
geographic_level == "Census Tract",
!is.na(data_value),
year == 2016,
measure_id %in% c("ACCESS2", "BINGE", "BPHIGH", "BPMED", "OBESITY", "CHECKUP",
"CHOLSCREEN", "CSMOKING", "DIABETES", "HIGHCHOL", "KIDNEY",
"LPA", "MHLTH", "PHLTH", "SLEEP", "CHD", "STROKE")) %>%
droplevels() %>%
select(unique_id, population_count, measure_id, data_value, short_question_text)
We first created another dataset that can be used for the animated scatterplot such that the outcome measure for coronary artery disease appears as a separate variable in the dataset.
nyc_cvrisk_limited = nyc_cvrisk %>%
filter(short_question_text %in% c("Annual Checkup", "Binge Drinking", "Cholesterol Screening", "Chronic Kidney Disease", "Current Smoking", "Diabetes", "Health Insurance", "High Blood Pressure", "High Cholesterol", "Mental Health", "Obesity", "Physical Health", "Physical Inactivity", "Sleep <7 hours", "Taking BP Medication")) %>%
droplevels() %>%
mutate(risk_factor = short_question_text) %>%
select(-short_question_text, -measure_id)
nyc_cad = nyc_cvrisk %>%
filter(short_question_text == "Coronary Heart Disease") %>%
droplevels() %>%
select(-measure_id, -short_question_text, -population_count)
nyc_cvrisk_joined = left_join(nyc_cvrisk_limited, nyc_cad, by = "unique_id")
We then installed the package required for animation
devtools::install_github('thomasp85/gganimate')
We then created an animated scatterplot with Coronary Heart DIsease prevalence on the X-axis and different risk factors on the Y-axis
theme_set(theme_bw()) # pre-set the bw theme.
## using transition_length and state_length
library(ggplot2)
library(gganimate)
p = nyc_cvrisk_joined %>%
ggplot(aes(x = data_value.y, y = data_value.x, frame = risk_factor)) +
geom_point(aes(size = population_count, colour = risk_factor ),
alpha = 0.5) +
xlim(0, 10) +
labs(title = "{closest_state}",
x = 'Coronary Heart Disease Prevalence',
y = 'Risk Factor Prevalence',
colour = 'Risk Factors',
size = 'Population Count') +
theme(plot.title = element_text(size = 40, face = "bold"),
axis.text=element_text(size=18),
axis.title=element_text(size=18,face="bold")) +
theme(legend.text=element_text(size=16), legend.title=element_text(size=18,face="bold") ) +
# gganimate parts
transition_states(risk_factor, transition_length = 1, state_length = 3, wrap = TRUE) +
enter_fade() +
exit_fade()
animate(p, fps = 1, height = 600, width = 1000, renderer = gifski_renderer())
To make the correlation plot, we first transformed our dataset in the wide format and called it nyc_cvrisk_wide.
nyc_cvrisk_wide =
nyc_cvrisk %>%
select(-measure_id)%>%
spread(key = short_question_text, value = data_value) %>%
janitor::clean_names()
The correlation plot for coronary heart disease and risk factors of interest is as follows:
nyc_cvrisk_wide %>%
select(annual_checkup:sleep_7_hours) %>%
select("coronary_heart_disease", everything())%>%
rename("CHD" = coronary_heart_disease,
"Annual Checkup" = annual_checkup,
"Binge Drinking" = binge_drinking,
"Kidney Disease" = chronic_kidney_disease,
"Current Smoking" = current_smoking,
"Diabetes" = diabetes,
"No Insurance" = health_insurance,
"Poor Mental Health" = mental_health,
"Obesity" = obesity,
"Poor Health" = physical_health,
"Physical Inactivity" = physical_inactivity,
"Poor Sleep" = sleep_7_hours)%>%
cor() %>%
corrplot(., method="circle")
The correlation plot for stroke and risk factors of interest is as follows:
nyc_cvrisk_wide %>%
select(annual_checkup:stroke)%>%
select(stroke, everything()) %>%
select(-c(coronary_heart_disease))%>%
rename("Stroke" = stroke,
"Annual Checkup" = annual_checkup,
"Binge Drinking" = binge_drinking,
"Kidney Disease" = chronic_kidney_disease,
"Current Smoking" = current_smoking,
"Diabetes" = diabetes,
"No Insurance" = health_insurance,
"Poor Mental Health" = mental_health,
"Obesity" = obesity,
"Poor Health" = physical_health,
"Physical Inactivity" = physical_inactivity,
"Poor Sleep" = sleep_7_hours)%>%
cor() %>%
corrplot(., method="circle")
We used the Shiny Application and leaflet package to perform visualization of geographic distribution of several risk factors and outcomes in New York City.
We also used Shiny Application to build a program to calculate individual risk of heart disease using the Framingham Risk Score